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Abstract 

Many learning tasks can be viewed as sequence prediction problems. For 

example, online classification can be converted to sequence prediction with 

the sequence being pairs of input/target data and where the goal is to cor- 

lO ■ rectly predict the target data given input data and previous input/target 



pairs. Solomonoff induction is known to solve the general sequence predic- 



tion problem, but only if the entire sequence is sampled from a computable 
distribution. In the case of classification and discriminative learning though, 
only the targets need be structured (given the inputs). We show that the 
normalised version of Solomonoff induction can still be used in this case, and 
more generally that it can detect any recursive sub-pattern (regularity) within 
an otherwise completely unstructured sequence. It is also shown that the un- 
normalised version can fail to predict very simple recursive sub-patterns. 
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1 Introduction 

The sequence prediction problem is the task of predicting the next symbol, x n after 
observing X1X2 • • -x n -i. Solomonoff induction [Sol64a, Sol64b] solves this problem 
by taking inspiration from Occam's razor and Epicurus' principle of multiple ex- 
planations. These ideas are formalised in the field of Kolmogorov complexity, in 
particular by the universal a priori semi-measure M. 

Let n(x n \xi ■ ■ -x n -i) be the true (unknown) probability of seeing x n having al- 
ready observed X\ ■ ■ -x n -\. The celebrated result of Solomonoff [Sol64a] states that 
if /j, is computable then 

lim [M(x n |xi • • -x n _i) — fj,(x n \xi ■ ■ -x n _i)] = with /^-probability 1 (1) 

That is, M can learn the true underlying distribution from which the data is sampled 
with probability 1. Solomonoff induction is arguably the gold standard predictor, 
universally solving many (passive) prediction problems [Hut04, Hut07, Sol64a]. 

However, Solomonoff induction makes no guarantees if /x is not computable. This 
would not be problematic if it were unreasonable to predict sequences sampled from 
incomputable //, but this is not the case. Consider the sequence below, where every 
even bit is the same as the preceding odd bit, but where the odd bits may be chosen 
arbitrarily. 

00 11 11 11 00 11 00 00 00 11 11 00 00 00 00 00 11 11 (2) 

Any child will quickly learn the pattern that each even bit is the same as the pre- 
ceding odd bit and will correctly predict the even bits. If Solomonoff induction is to 
be considered a truly intelligent predictor then it too should be able to predict the 
even bits. More generally, it should be able to detect any computable sub-pattern. 
It is this question, first posed in [Hut04, Hut09] and resisting attempts by experts 
for 6 years, that we address. 

At first sight, this appears to be an esoteric question, but consider the following 
problem. Suppose you are given a sequence of pairs, xxyxXiyi^y-i ■ ■ ■ where Xi is 
the data for an image (or feature vector) of a character and yi the corresponding 
ascii code (class label) for that character. The goal of online classification is to 
construct a predictor that correctly predicts y^ given X; L based on the previously seen 
training pairs. It is reasonable to assume that there is a relatively simple pattern 
to generate yi given Xi (humans and computers seem to find simple patterns for 
character recognition). However it is not necessarily reasonable to assume there 
exists a simple, or even computable, underlying distribution generating the training 
data Xi. This problem is precisely what gave rise to discriminative learning [LS06]. 

It turns out that there exist sequences with even bits equal to preceding odd 
bits on which the conditional distribution of M fails to converge to 1 on the even 
bits. On the other hand, it is known that M is a defective measure, but may be 
normalised to a proper measure, M raorm . We show that this normalised version does 



converge on any recursive sub-pattern of any sequence, such as that in Equation (2). 
This outcome is unanticipated since (all?) other results in the field are independent 
of normalisation [Hut04, Hut07, LV08, Sol64a]. The proofs are completely different 
to the standard proofs of predictive results. 

2 Notation and Definitions 

We use similar notation to [Gac83, Gac08, Hut04]. For a more comprehensive in- 
troduction to Kolmogorov complexity and Solomonoff induction see [Hut04, Hut07, 
LV08, ZL70]. 

Strings. A finite binary string a; is a finite sequence X\X 2 x^ • • • x n with xi G B = 
{0, 1}. Its length is denoted £(x). An infinite binary string u is an infinite sequence 
U1U2UJ3 ■ • • ■ The empty string of length zero is denoted e. B n is the set of all binary 
strings of length n. B* is the set of all finite binary strings. B°° is the set of all infinite 
binary strings. Substrings are denoted x S]t :— x s x s+ i ■ ■ -x t -iXt where s,t G N and 
s < t. If s > t then x s:t = e. A useful shorthand is x <t :— Xi-.t-i- Strings may be 
concatenated. Let x,y G B* of length n and m respectively. Let u G B°°. Then, 

xy := XiX 2 ■ ■ ■ x n - 1 x n y 1 y 2 ■ ■ ■ y m -iy m 

XUJ '.— XiX 2 ■ ■ ■ X n _iX n UJiU>20U3 ■ ■ ■ 

For b G B, -<b — if b — 1 and ~^b = 1 if b = 0. We write x C y if x is a prefix of y. 
Formally, x C y if £(x) < £(y) and Xi = yi for all 1 < % < £(x). x \Z y if x C y and 

£(x)<£(y). 

Complexity. Here we give a brief introduction to Kolmogorov complexity and the 
associated notation. 

X 

Definition 1 (Inequalities). Let /, g be real valued functions. We write f(x) > g(x) 

X 

if there exists a constant c > such that f(x) > c ■ g(x) for all x. f(x) < g(x) is 
defined similarly. f(x) = g(x) if f(x) < g(x) and f(x) > g(x). 

Definition 2 (Measures). We call \x : B* — > [0, 1] a semimeasure if n(x) > 
^ 6eB /x(a;6) for all x G B*, and a probability measure if equality holds and /x(e) = 1. 
fj,(x) is the /^-probability that a sequence starts with x. fi(b\x) := ^jQ- is the prob- 
ability of observing b G B given that x G B* has already been observed. A function 
P : B* — ?> [0, 1] is a semi- distribution if XLeB* P{x) < 1 and a probability distribu- 
tion if equality holds. 

Definition 3 (Enumerable Functions). A real valued function / : A — ¥ R is 

enumerable if there exists a computable function / : A x N — > Q satisfying 
lim^oo /(a, t) = f(a) and /(a, t + 1) > /(a, t) for all a G A and t G N. 



Definition 4 (Machines). A Turing machine L is a recursively enumer- 
able set (which may be finite) containing pairs of finite binary strings 
(p\y 1 ),(p 2 ,y 2 ),(p 3 ,y 3 ),---. 

L is a prefix machine if the set {p 1 , p 2 , p 3 ■ ■ ■ } is prefix free (no program is a prefix 
of any other). It is a monotone machine if for all (p,y), (q,x) G L with £(x) > £(y), 
p H. q =>- i/Ci 

We define L(p) to be the set of strings output by program p. This is different 
for monotone and prefix machines. For prefix machines, L(p) contains only one 
element, y G L(p) if (p,y) G L. For monotone machines, y G L(p) if there exists 
(p, x) G L with y C. x and there does not exist a (q, z) G L with q \Z p and 
y n. z. For both machines L(p) represents the output of machine L when given 
input p. If L(p) does not exist then we say L does not halt on input p. Note that for 
monotone machines it is possible for the same program to output multiple strings. 
For example (1, 1), (1, 11), (1, 111), (1, 1111), • • • is a perfectly legitimate monotone 
Turing machine. For prefix machines this is not possible. Also note that if L is a 
monotone machine and there exists an x G B* such that X\ :n G L(p) and X\ :m G L(p) 
then x\-, r G L(p) for all n < r < m. 

Definition 5 (Complexity). Let L be a prefix or monotone machine then define 

X L (y):= J2 2 ~ Hp) C L (y):= mm {£(p):yeL(p)} 

i-^i pets* 

p:y&L(p) 

If L is a prefix machine then we write rrii(t/) = Xi{y)- If L is a monotone machine 
then we write Mi(?/) = Al(j/). Note that if L is a prefix machine then A^ is an 
enumerable semi-distribution while if L is a monotone machine, A^ is an enumerable 
semi-measure. In fact, every enumerable semi-measure (or semi-distribution) can be 
represented via some machine L as A^. 

For prefix/monotone machine L we write L t for the first t program/output pairs 
in the recursive enumeration of L, so L t will be a finite set containing at most t 
pairs. 1 

The set of all monotone (or prefix) machines is itself recursively enumerable 
[LV08], 2 which allows one to define a universal monotone machine Um as follows. 
Let L l be the ith monotone machine in the recursive enumeration of monotone 
machines. 

(i'p, y) G U M <£> (p, y) G U 

where i' is a prefix coding of the integer i. A universal prefix machine, denoted Up, 
is defined in a similar way. For details see [LV08]. 



1 Lt will contain exactly t pairs unless L is finite, in which case it will contain t pairs until t is 
greater than the size of L. This annoyance will never be problematic. 

2 Note the enumeration may include repetition, but this is unimportant in this case. 



Theorem 6 (Universal Prefix/Monotone Machines). For the universal monotone 
machine Um and universal prefix machine Up, 

m Up (y) > c L m L (y) for all y G B* M UM (y) > c L M L (y) for all y G B* 

where cl > depends on L but not y. 

For a proof, see [LV08] . As usual, we will fix reference universal prefix/monotone 
machines Up, Um and drop the subscripts by letting, 

m(y) := muM = £ 2^> M(y) := M UM (y) = £ 2~^ 

p--y£U P (p) p--yeU M (p) 

K{y) := C Up {y) = min {£(p) : y G U P (p)} Km(y) := min {^(p) : y G C/ M (p)} 

pGB* p£B* 

The choice of reference universal Turing machine is usually 3 unimportant since a 
different choice varies m, M by only a multiplicative constant, while K, Km are 
varied by additive constants. For natural numbers n we define K(n) by K((n)) 
where (n) is the binary representation of n. 

M is not a proper measure, M(x) > M(xO) + M(xl) for all x G £>*, which means 
that M(0|rr) + M(l|rr) < 1, so M assigns a non-zero probability that the sequence 
will end. This is because there are monotone programs p that halt, or enter infinite 
loops. For this reason Solomonoff introduced a normalised version, M raorm defined 
as follows. 

Definition 7 (Normalisation). 

. M norm (y 1:n ) M(y 1:n ) 



P**-norm\£) ■ J- w*-norm yijn \ V <r 



M norm {y <n ) ' M{y <n 0) + M(y <n l) ' 



This normalisation is not unique, but is philosophically and technically the most 
attractive and was used and defended by Solomonoff. Historically, most researchers 
have accepted the defective M for technical convenience. As mentioned, the differ- 
ence seldom matters, but in this paper it is somewhat surprisingly crucial. For a 
discussion of normalisation, see [LV08] . 

Theorem 8. The following are results in Kolmogorov complexity. Proofs for all can 
be found in [LV08J. 

1. m(x) = 2~ K W 

2 2~ K ( xb ) = 2~ K< y x ^ b ) 

3. M(x) > mix) 



3 See [HM07] for a subtle exception. All the results in this paper are independent of universal 
Turing machine. 



4- If P is an enumerable semi- distribution, then m(y) > P(y) 

X 

5. If /i is an enumerable semi-measure, then M(y) > fi(y) 

Note the last two results are equivalent to Theorem 6 since every enumerable 
semi- (measure/distribution) is generated by a monotone/prefix machine in the sense 
of Theorem 6 and vice- versa. 

Before proceeding to our own theorems we need a recently proven result in al- 
gorithmic information theory. 



Theorem 9. [Lempp, Miller, Ng and Turetsky, 2010, unpublished, private commu- 

M(w< n ) 



nication] lim n ^oo ™y^ <n -l = ; for all u G B c 



3 'M.nnrm Predicts Selected Bits 



L norm 

The following Theorem is the main positive result of this paper. It shows that any 
computable sub-pattern of a sequence will eventually be predicted by M„ orm . 

Theorem 10. Let f : B* — > B U {e} be a total recursive function and u G B°° 
satisfying f(ou <n ) = u n whenever f(u) <n ) ^ e. If f(uj <n .) ^ e is defined for an 
infinite sequence ni, n 2 , n 3 , • • • then lim^oo M norm (cu n . \u> <TH ) = 1. 

Essentially the Theorem is saying that if there exists a computable predictor / 
that correctly predicts the next bit every time it tries (i.e when f(u <n ) ^ e) then 
M norm will eventually predict the same bits as /. By this we mean that if you 
constructed a predictor / Mnorm defined by fM norm (u<n) = argmax 6eB M„ orm (6|w <n ), 
then there exists an iV such that fM norm ( UJ <n) = f(w< n ) f° r all n > A^ where 
f(u) <n ) 7^ e. For example, let / be defined by 

ff x \ = f x *W if ^) odd 
le otherwise 

Now if u G B°° satisfies U2 n = f{oo<2n) — {jJ 2n-\ f° r all n G N then Theorem 10 shows 
that limn^oo M norm (w 2 n|w<2n) = 1- It says nothing about the predictive qualities of 
M raorm on the odd bits, on which there are no restrictions. 

The proof essentially relies on using / to show that monotone programs for 
u <ni ^uj ni can be converted to prefix programs. This is then used to show that 
~M.(u <ni -iuj ni ) = m(oj <ni -ioo ni ). The result will then follow from Theorem 9. 

Theorem 10 insists that / be totally recursive and that f(u <n ) — e if / refrains 
from predicting. One could instead allow / to be partially recursive and simply not 
halt to avoid making a prediction. The proof below breaks down in this case and we 
suspect that Theorem 10 will become invalid if / is permitted to be only partially 
recursive. 



Proof of Theorem 10. We construct a machine L from Um consisting of all programs 
that produce output that / would not predict. We then show that these programs 
essentially form a prefix machine. Define L by the following process 

1. L : = and t : = 1. 

2. Let (p, y) be the tth pair in Um- 

3. Let i be the smallest natural number such that t/j ^ f{y<i) 7^ e - That is, i is 
the position at which / makes its first mistake when predicting y. If / makes 
no prediction errors then % doesn't exist. 4 

4. If i exists then L := L U {(p,yi-i)} (Note that we do not allow L to contain 
duplicates). 

5. t :— t + 1 and go to step 2. 

Since / is totally recursive and Um is recursively enumerable, the process above 
shows that L is recursively enumerable. It is easy to see that L is a monotone 
machine. Further, if (p,y), (q,x) G L with p C. q then y = x. This follows since by 
monotonicity we would have that y C x, but /(^<^( y )) = /(y<%)) 7^ Z/%) = £%) 
and by steps 3 and 4 in the process above we have that £(x) = ^(y). 

Recall that L t is the tth enumeration of L and contains £ elements. Define 
L>t Q L t to be the largest prefix free set of shortest programs. Formally, (p, y) G 
L t if there does not exist a (q, x) G L t such that q \Z p. For example, if L t = 
(1, 001), (11, 001), (01, 11110), (010, 11110) then L t = (1, 001), (01, 11110). If we now 
added (0, 11110) to L t to construct L t+1 then L t+1 would be (1,001), (0, 11110). 

Since L t is finite, L t is easily computable from L t . Therefore the following 
function is computable. 

P{y,t):= Yl 2-*«>0. 

Now L t is prefix free, so by Kraft's inequality Y2 v eB* P(Vit) ^ 1 for all t G N. We 
now show that P(y,t + 1) > P(y,t) for all y £ B* and t 6 ff which proves that 
-P(y) = lim^oo P(y, t) exists and is a semi-distribution. 

Let (p,y) be the program/output pair in L t+1 but not in L t . To see how P(-,t) 
compares to P(-,t + 1) we need to compare L t and L t+ i. There are three cases: 

1. There exists a (q,x) G L t with q \Z p. In this case L i+ i = L t . 



4 This is where the problem lies for partially recursive prediction functions. Computing the 
smallest i for which / predicts incorrectly is incomputable if / is only partially recursive, but 
computable if it is totally recursive. It is this distinction that allows L to be recursively enumerable, 
and so be a machine. 



2. There does not exist a (q, x) G L t such that p \Z q. In this case (p, y) is simply 
added to L t to get L t+ i and so L t C L t+ i. Therefore P(-,£ + 1) > P(-,£) is 
clear. 

3. There does exist a (5, x) G Z-t such that p \Z q. In this case Z i+1 differs from 
L t in that it contains (p,y) but not (q, x). Since p \Z q we have that y = x. 
Therefore P(y,t + 1) - P(y,i) = 2^ (p) - 2~ e(q) > since p C g. For other 
values, P(-,£) = P(-,£+l). 

Note that it is not possible that p = q since then x = y and duplicates are not added 
to L. Therefore P is an enumerable semi-distribution. By Theorem 8 we have 

X 

m(u <ni -w ni ) > P(u <n ^u ni ) (3) 

X 

where the constant multiplicative fudge factor in the > is independent of i. Suppose 
io <n -Mjj ni G Um(j>)- Therefore there exists a y such that ui <ni ->ui ni C. y and (p,y) G 
Um- By parts 2 and 3 of the process above, {p,ui <ni —>ui ni ) is added to L. Therefore 
there exists a T G N such that (p, u <rii -iu ni ) G L t for all t >T. 

Since u <ni ^u ni G Um(p), there does not exist a q \Z p with ui <n —>ui ni G Um(q)- 
Therefore eventually, (p,u) <Tli -iU) ni ) G Z t for all t >T. Since every program in Um 
for uj <ni ->ui ni is also a program in L, we get 

lim P(uj <ni -iu> ni ,t) = P(u <ni -^u ni ) = M(u> <ni -^ix> ni ). 

t— >oo 



Next, 



■\k i 1 ^ — M(u> <ni ->u ni ) 

Ni norm \-^w ni w< n j = — (4j 

M(u;< ni u; n J + M(u;< ni -iu; n J 

x m(u; <ra< -w ra J 



M(w 1:ni 

m(^l:nj 

M(w 1:ni ; 



(6) 



where Equation (4) follows by the definition of M norm . Equation (5) follows from 
Equation (3) and algebra. Equation (6) follows since m(xb) = 2~ K<yXhS} = 2~ K ( X ^ = 
m(x-i6), which is Theorem 8. However, by Theorem 9, lim^oo ^ ( ^ <n ' { = and so 

lim i _ +00 M norm (-ia; ni |a;<„ i ) = 0. Therefore lim i ^. 00 M n0T . m (a; ni |a;< ni ) = 1 as required. 

D 

We have remarked already that Theorem 10 is likely not valid if / is permitted 
to be a partial recursive function that only output on sequences for which they make 
a prediction. However, there is a class of predictors larger than the totally recursive 
ones of Theorem 10, which M„ orm still learns. 



Theorem 11. Let f : B* — > B U {e} be a partial recursive function and u G B°° 
satisfying 

1. f(u <n ) is defined for all n. 

2. f{uj <n ) = UJ n whenever f(u> <n ) ^ e. 

If f{oj <ni ) G B for an infinite sequence ni, n 2 , n 3 , ■ ■ ■ then 

lim M norm (u; ni |c<; <ra j = 1. 

n-oc 

The difference between this result and Theorem 10 is that / need only be defined 
on all prefixes of at least one u G B°° and not everywhere in B*. This allows for a 
slightly broader class of predictors. For example, let u = p 1 b 1 p 2 b 2 p 3 b 3 ■ ■ ■ where p l 
is some prefix machine that outputs at least one bit and b % is the first bit of that 
output. Now there exists a computable / such that f{p 1 b 1 ■ ■ ■p l ~ l b l ~ l p l ) = b % for all 
i and f(ou <n ) = e whenever u n ^ b l for some i (/ only tries to predict the outputs). 
By Theorem 11, M norm will correctly predict b % . 

The proof of Theorem 11 is almost identical to that of Theorem 10, but with 
one additional subtlety. 

Proof sketch. The proof follows that of Theorem 10 until the construction of L. 
This breaks down because step 3 is no longer computable since / may not halt on 
some string that is not a prefix of u. The modification is to run steps 2-4 in parallel 
for all t and only adding (p,yi-i) to L once it has been proven that /(y<«) 7^ y-i and 
f(y<k) halts for all k < i, and either chooses not to predict (outputs e), or predicts 
correctly. Since / halts on all prefixes of cu, this does not change L for any programs 
we care about and the remainder of the proof goes through identically. 

It should be noted that this new class of predictors is still less general than allow- 
ing / to an arbitrary partial recursive predictor. For example, a partial recursive / 
can predict the ones of the halting sequence, while choosing not to predict the zeros 
(the non-halting programs). It is clear this cannot be modified into a computable 
/ predicting both ones and zeros, or predicting ones and outputting e rather than 
zero, as this would solve the halting problem. 

4 M Fails to Predict Selected Bits 

The following theorem is the corresponding negative result that while the conditional 
distribution of M„ orm converges to 1 on recursive sub-patterns, M can fail to do so. 

Theorem 12. Let f : B* — > B U {e} be the total recursive function defined by, 

_ (z i(z) if£(z) odd 
I e otherwise 



9 



There exists an infinite string uj G £>°° with U2 n = f(^<2n) = W2n-i f or all n £ N 
such that 

liminf M(w 2 n|w<2n) < 1- 

n— >oo 

The proof requires some lemmas. 
Lemma 13. lSA(xy) can be bounded as follows. 

2 KWx)) M(y) > M(xy) > M(y)2- K{x) . (7) 

Proof. Both inequalities are proven relatively easily by normal methods as used in 
[LV08] and elsewhere. Nevertheless we present them as a warm-up to the slightly 
more subtle proof later. 

Now construct monotone machine L, which we should think of as taking two 
programs as input. The first, a prefix program p, the output of which we view 
as a natural number n. The second, a monotone program. We then simulate the 
monotone machine and strip the first n bits of its output. L is formally defined as 
follows. 

1. L:=0, t: = 1 

2. Let (p,n),(q,y) be the tth pair of program/outputs in Up x Um, which is 
enumerable. 

3. If £(y) > n then add {pq,y n +i-i( y )) to L 

4. t :— t + 1 and go to step 2 

By construction, L is enumerable and is a monotone machine. Note that if xy G 
Um(q) and £(x) G Up{p) then y G L(pq). Now, 

M(y)>M L (y)= £ 2"^ > £ 2"^ (8) 

r:yeL{r) q,p:xy^U M {<l)A x )^Up{p) 

= J2 2 ~ £{q) J2 2 ~ £{P) - ^(xy)m(£(x)) (9) 

q:xy£U M {q) p:£(x)&U P (p) 

= M(xy)2- KWx)) (10) 

where Equation (8) follows by Theorem 6, definitions and because if xy G Um(q) 
and £(x) G Up{p) then y G L(pq). Equation (9) by algebra, definitions. Equation 
(10) by Theorem 8. 

The second inequality is proved similarly. We define a machine L as follows, 

1. L = Q,t:=l 

2. Let (q,x), (r,y) be the tth element in Up x Um, which is enumerable. 

10 



3. Add (qr,xy) to L 

4. t :— t + 1 and go to step 2 

It is easy to show that L is monotone by using the properties of Up and Um- Now, 

M(xy) > M L (xy) = Yl 2 ~ %) ^ Yl 2 ~ %r) 

p:xy£L(p) q,r:x€Up(q),yeU M (r) 

= J2 2 ~ %) J2 ^^ - m ( x ) M (v) = 2~ K{x) M(y). 

q:x€Up(q) r:yeU M (r) 

a 

Lemma 14. There exists an u> G B°° such that 

liminf [M(0|w<„) + M(l|w<„)] = 0. 

Proof. First we show that for each S > there exists a z G B* such that M(0|z) + 
M(l|z) < 5. This result is already known and is left as an exercise (4.5.6) with a 
proof sketch in [LV08]. For completeness, we include a proof. Recall that M(-,t) is 
the function approximating M(-) from below. Fixing an n, define z G B* inductively 
as follows. 

1. z:=e 

2. Let t be the first natural number such that Nl(zb, t) > 2~ n for some b G B. 

3. If t exists then z := z~b and repeat step 2. If t does not exist then z is left 
unchanged (forever). 

Note that z must be finite since each time it is extended, M(z&, t) > 2~ n . Therefore 
M(z-ib,t) < M(z,t) — 2~ n and so each time z is extended, the value of M(z, t) 
decreases by at least 2~ n so eventually M(^6, t) < 2~ n for all b G B. Now once the 
z is no longer being extended (t does not exist in step 3 above) we have 

M{z0)+M(zl)<2 1 - n . (11) 

However we can also show that M(z) > 2~ K ^ a \ The intuitive idea is that the 
process above requires only the value of n, which can be encoded in K(n) bits. 
More formally, let p be such that n G Up(p) and note that the following set is 
recursively enumerable (but not recursive) by the process above. 

L P := (p, e), (p, z 1:1 ), (p, z 1:2 ), (p, z 1:3 ), ■■■ ,(p, «i : <( z )_i), (p, z VI{z) ). 



11 



Now take the union of all such sets, which is a) recursively enumerable since Up is, 
and b) a monotone machine because Up is a prefix machine. 

L:= |J L p . 

(p,n)&U P 



Therefore 



M(z) > M L (z) > T K(n) 



:i2) 



where the first inequality is from Theorem 6 and the second follows since if n* is 
the program of length K(n) with Up(n*) = n then (n*, Z\.h z \) G L. Combining 
Equations (11) and (12) gives 

M(0|*) + M(l|3) <2 1 ~ n+K ( n \ 

Since this tends to zero as n goes to infinity, 5 for each 8 > we can construct a 
z G B* satisfying M(0|z) + M(l|z) < 5, as required. For the second part of the 
proof, we construct u by concatenation. 

12 3 

u :— z z z ■ ■ ■ 



where z n G B* is chosen such that, 

M(0|,2 n ) + M(l|,2 n ) <8 n 
with 8 n to be chosen later. Now, 

M(^ 1 • • • z n b) 



M(6|^---z n ) 



X 

< 



M(z l ---z n ) 

~2K(i(z 1 -z n - 1 ))+K(z 1 -z n - 1 ) 
2K(e(z 1 ---z n - 1 ))+K(z 1 ---z n - 1 ) 



M(z n b) 
M(z n ) 

M(b\z n ) 



(13) 

(14) 

(15) 
(16) 



where Equation (14) is the definition of conditional probability. Equation (15) fol- 



lows by applying Lemma 13 with x = z z ■ ■ ■ z n and y = z n or z n b. Equation 
(16) is again the definition of conditional probability. Now let 

n—n 

r _ f 

° n ~ 2KWz 1 -z"- 1 ))+K(z 1 -z™- 1 )- 

Combining this with Equations (13) and (16) gives 



M(0|^ • • • z n ) + M(l|^ 



z n ) < 2~ 



5 An integer n can easily be encoded in 21ogn bits, so K(n) < 21ogn + c for some c > 
independent of n. 



12 



Therefore, 

liminf [M(0|w<„) + M(l|w <n )] = 

as required. D 

Proof of Theorem 12. Let Co G B°° be defined by Gj^n '■= &2n-i '■= u n where u is the 
string defined in the previous lemma. Recall Um '■= {(p 1 ^ 1 ), (p 2 ,y 2 ), • • • } is the 
universal monotone machine. Define monotone machine L by the following process, 

1. L = 0, t = l 

2. Let (p, y) be the tth element in the enumeration of Um 

3. Add (p, yiyzy^yi ■■■ ) to L 

4. t :— t + 1 and go to step 2. 

Therefore if cu<2 n £ Um{p) then wi ;n G P(p). By identical reasoning as elsewhere, 

M(u 1:n ) > M{u <2n ). (17) 

In fact, M(wi :n ) = M(cj <2 „), but this is unnecessary. Let P := 

{p : 3b G £> s.t wi : „6 G ^m(p)} and Q := {p '■ u>i :n G Um(j>)} 3 P. Therefore 



M(0|wi :n ) - M(l|wi, 



V 2^( p ) 

Now let P := {p : 36 G i3 s.t w< 2n & G £/m(p)} and (Q := {p : w< 2n G E/m(p)} 3 P- 
Define monotone machine L by the following process 

1. L = Q,t:=l 

2. Let (p,y) be the tth program/output pair in Um 

3. Add (p,yiyiy 2 y2 • • • 2tt(w)-i3/«(i/)-iS/<(y)) to ^ 

4. £ := £ + 1 and go to step 2. 

Let p <E Q — P. Therefore wi ;n G Um(j>) and u;i :n 6 ^ Um(p) for any b E B. Therefore 
w<2n G P(p) while 0J < 2nb ^ I/(p) for any b & B. Now there exists an « such that L is 
the ith machine in the enumeration of monotone machines, L l . 
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Therefore, by the definition of the universal monotone machine Um we have 
that w <2 „6 ^ UM(i'p) = L l (p) = L(p) 3 w <2n and Um^'p) = L(p) for any b G B. 
Therefore i'p G Q — P and so, 

Y^ 2^ q) > V 2^ (i ' p) > V 2- £{i ' p) = V 2~ i{p) . (18) 

qeQ-P p-.i'p&Q-P p€Q~P p<=Q-P 

Therefore 

y _ _ 2~ e ^ 

1 - M(0|u; <2n ) - M(l|u> <2n ) = Z ^! e /r Q ' P r (19) 

> ^ e Q- p (20) 

= 1 - M(0|wi :n ) - M(l|wi :n ) (21) 

where Equation (19) follows from the definition of P, Q and M. Equation (20) by 
(18) and (17). Equation (21) by the definition of P, Q and M. Therefore by Lemma 
14 we have 

limsup [1 - M(0]a>< 2n ) - M(l|u> <2 „)] > limsup [1 - M(0|wi :n ) - M(l|wi :n )] = 1. 

n->oo n—¥oo 

Therefore liminfn^oo M(w 2n |w <2n ) < 1 as required. □ 

Note that lim n ^ 00 ~M.(u)2n\&<2n) ^ in fact, one can show that there exists a 
c > such that M(w 2n |cj <2n ) > c for all n G N. In this sense M can still be used to 
predict in the same way as M norm , but it will never converge as in Equation (1). 

5 Discussion 

Summary. Theorem 10 shows that if an infinite sequence contains a computable 
sub-pattern then the normalised universal semi-measure M raorm will eventually pre- 
dict it. This means that Solomonoff 's normalised version of induction is effective in 
the classification example given in the introduction. Note that we have only proven 
the binary case, but expect the proof will go through identically for arbitrary finite 
alphabet. 

On the other hand, Theorem 12 shows that plain M can fail to predict such 
structure in the sense that the conditional distribution need not converge to 1 on 
the true sequence. This is because it is not a proper measure, and does not converge 
to one. These results are surprising since (all?) other predictive results, including 
Equation (1) and many others in [Hut04, Hut07, LV08, Sol64a], do not rely on 
normalisation. 

Consequences. We have shown that M raorTra can predict recursive structure in 
infinite strings that are incomputable (even stochastically so). These results give 
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hope that a Solomonoff inspired algorithm may be effective at online classification, 
even when the training data is given in a completely unstructured way. Note that 
while M is enumerable and M raorTra is only approximable, 6 both the conditional 
distributions are only approximable, which means it is no harder to predict using 
M norm than M. 

Open Questions. A number of open questions were encountered in writing this 
paper. 

1. Extend Theorem 10 to the stochastic case where a sub-pattern is generated 
stochastically from a computable distribution rather than merely a computable 
function. It seems likely that a different approach will be required to solve this 
problem. 

2. Another interesting question is to strengthen the result by proving a conver- 
gence rate. It may be possible to prove that under the same conditions as 

X 

Theorem 10 that Y^Li t 1 ~ M norm(w„Ju; <T J] < K(f) where K(f) is the (pre- 
fix) complexity of the predicting function /. Again, if this is even possible, it 
will likely require a different approach. 

3. Prove or disprove the validity of Theorem 10 when the totally recursive pre- 
diction function / (or the modified predictor of Theorem 11) is replaced by a 
partially recursive function. 

Acknowledgements. We thank Wen Shao and reviewers for valuable feedback 
on earlier drafts and the Australian Research Council for support under grant 
DP0988049. 



6 A function / is approximable if there exists a computable function /(•, t) with Hindoo /(•, t) 
/(•). Convergence need not be monotonic. 
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A Table of Notation 



Symbol 


Description 


B 


Binary symbols, and 1 


Q 


Rational numbers 


N 


Natural numbers 


B* 


The set of all finite binary strings 


B°° 


The set of all infinite binary strings 


x,y,z 


Finite binary strings 


u> 


An infinite binary string 


U! 


An infinite binary string with even bits equal to preceding odd bits 


£(x) 


The length of binary string x 


^b 


The negation of binary symbol b. ->b = if b = 1 and ->b — 1 if b — 


p,q 


Programs 


n 


An enumerable semi-measure 


M 


The universal enumerable semi-measure 


iV*-norm 


The normalised version of the universal enumerable semi-measure 


m 


The universal enumerable semi-distribution 


K(f) 


The prefix Kolmogorov complexity of a function / 


L 


An enumeration of program/output pairs defining a machine 


U M 


The universal monotone machine 


Up 


The universal prefix machine 


X 

> 


X 

f(x) > d{ x ) if there exists a c > such that f(x) > c ■ g(x) for all x 


X 

< 


X 

f(x) < g(x) if there exists a c > such that f(x) < c ■ g(x) for all x 


X 


f(x) = g(x) if f(x) > g(x) and f(x) < g(x) 


x \Z y 


x is a prefix of y and £(x) < £(y) 


x n. y 


a; is a prefix of y 



17 



